About the project

Write a short description about the course and add a link to your GitHub repository here. This is an R Markdown (.Rmd) file so you can use R Markdown syntax.

My expectations and learning objectives

##What do you expect to learn? Where did you hear about the course?

I hope to learn mostly modeling and clustering with r. I think that this course also provides me with the opportunity to refresh my memory on working with r and it’s functions.I I heard about this course from our kvantitative studies teacher Arho Toikka, who highly recommended this course for me. Seeing as i was already very interested about data science, this seemed like the perfect opportunity to better my knowledge on the matter.


Week 2 Exercises

install.packages("dplyr", repos = "http://cran.us.r-project.org")
## Installing package into '/Users/Lotta/Library/R/3.6/library'
## (as 'lib' is unspecified)
## 
## The downloaded binary packages are in
##  /var/folders/1y/kp4b3mz525sg68ltbzkcnf200000gn/T//RtmpVqKiAl/downloaded_packages
install.packages("ggplot2", repos = "http://cran.us.r-project.org")
## Installing package into '/Users/Lotta/Library/R/3.6/library'
## (as 'lib' is unspecified)
## 
## The downloaded binary packages are in
##  /var/folders/1y/kp4b3mz525sg68ltbzkcnf200000gn/T//RtmpVqKiAl/downloaded_packages
install.packages("GGally", repos = "http://cran.us.r-project.org")
## Installing package into '/Users/Lotta/Library/R/3.6/library'
## (as 'lib' is unspecified)
## 
## The downloaded binary packages are in
##  /var/folders/1y/kp4b3mz525sg68ltbzkcnf200000gn/T//RtmpVqKiAl/downloaded_packages
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa

1. Reading the data and describing it

analysisData <- read.table('data/learning2014')
summary(analysisData$Points)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    7.00   19.00   23.00   22.72   27.75   33.00

“This data is collected from the social science students of the Univeristy of Helsinki. The survey was conducted in a social sciences statistical course for bachelor students in 2014 December to 2015 January. The survey consists of 7 variables that describe the gender (factored as female or male), age, attitude (towards the study of statistics), deep learning (pupils own attitudes and practices towards deep learning), surface learning and strategic learning as well as their received points in the final exam of the course. The learning scales mostly try to get the student to show what kind of strategies and types of learning they try to implement in their studies. All the learning categories are number values in likert-scales (values from 1 to 5). Points, attitude and age are integer values. The survey was completed in Finnish.”

2. Graphical overview of the data

ggpairs(analysisData, lower = list(combo = wrap("facethist", bins = 20)))

Correlation between variables

“In out pairs plot we can see that the correlations between the variables are quite low. Only a real significant correlation can be found between attitude and points (0,44) and deep learning strategies and surface level strategies (the correlation between the two strategies is negative).”

Distributions of variances

“Almost all of the variances of the variables are noramlly distributed.Tge onyl exception is age, which is explained due to the fact that the participants of the survey are students, who are young and mostly the same age.”

Distribution of residuals

“In most of the cases we cannot see a clear linear pattern in the residuals between two variables. Some show a clear pattern, like attitude and surface learning strategies, where the residuals are skewed towards negative values.”

3. Own linear regression model & 4. Summary of the fitted model

myRegressionModel <- lm(Points ~ Attitude + stra + surf, data = analysisData)
summary(myRegressionModel)
## 
## Call:
## lm(formula = Points ~ Attitude + stra + surf, data = analysisData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.01711    3.68375   2.991  0.00322 ** 
## Attitude     0.33952    0.05741   5.913 1.93e-08 ***
## stra         0.85313    0.54159   1.575  0.11716    
## surf        -0.58607    0.80138  -0.731  0.46563    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08

“Here i created a linear regression model where the students attitude, strategic learning techniques and surface techniques try to explain the students achievement levels in the final exam of the course. In this first model, we can see that surface and stategic learning do not have any significance on the students outcomes in the course test. Their attitude towards the study of statistics, however, proves to be very significant.”

secondModel <- lm(Points ~ Attitude, data = analysisData)
summary(secondModel)
## 
## Call:
## lm(formula = Points ~ Attitude, data = analysisData)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

“In the second model, we can see that the students attitudes have an extremly significant causal relation to the points that they receive in the final exam. This means that the likelyhood to get a dataset this deviant is less than one percent. In this model one point in attitude increased the points that a student got in the exam by 0.35. The multiple R-squared means that the variance of the points in the test can be explained by about 19 percent by the differences in the attitudes of the students.”

5. Diagnostic plots

par(mfrow = c(2,2))
plot(x=secondModel, which= c(1,2,5))

“In the assumptions of the model, we assume the variables are normally distributed and that their variance is constant. From the residuals vs fitted, we can see that there seems to be no clear pattern in the distribution of the variables and thefore we can conclude that the model is quite linear. Also the variance is seems to be quite constant as the datapoints dont seem to stray too much from the mean. The In the normalQQ- plot we can see that our data pretty much follows the line. This means that the data is normally distributed. In the residuals and leverage plot we see that there arent really any outliers that have a big leverage on the rest of the data. The data is bit skewed down, but not significantly.”


Chapter3 exercises

2) Reading joined student alcohol consumption data

library(dplyr)
library(ggplot2)
alc <- read.table(file = "data/alc", header = TRUE)
str(alc)
## 'data.frame':    382 obs. of  35 variables:
##  $ school    : Factor w/ 2 levels "GP","MS": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sex       : Factor w/ 2 levels "F","M": 1 1 1 1 1 2 2 1 2 2 ...
##  $ age       : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ address   : Factor w/ 2 levels "R","U": 2 2 2 2 2 2 2 2 2 2 ...
##  $ famsize   : Factor w/ 2 levels "GT3","LE3": 1 1 2 1 1 2 2 1 2 1 ...
##  $ Pstatus   : Factor w/ 2 levels "A","T": 1 2 2 2 2 2 2 1 1 2 ...
##  $ Medu      : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ Fedu      : int  4 1 1 2 3 3 2 4 2 4 ...
##  $ Mjob      : Factor w/ 5 levels "at_home","health",..: 1 1 1 2 3 4 3 3 4 3 ...
##  $ Fjob      : Factor w/ 5 levels "at_home","health",..: 5 3 3 4 3 3 3 5 3 3 ...
##  $ reason    : Factor w/ 4 levels "course","home",..: 1 1 3 2 2 4 2 2 2 2 ...
##  $ nursery   : Factor w/ 2 levels "no","yes": 2 1 2 2 2 2 2 2 2 2 ...
##  $ internet  : Factor w/ 2 levels "no","yes": 1 2 2 2 1 2 2 1 2 2 ...
##  $ guardian  : Factor w/ 3 levels "father","mother",..: 2 1 2 2 1 2 2 2 2 2 ...
##  $ traveltime: int  2 1 1 1 1 1 1 2 1 1 ...
##  $ studytime : int  2 2 2 3 2 2 2 2 2 2 ...
##  $ failures  : int  0 0 2 0 0 0 0 0 0 0 ...
##  $ schoolsup : Factor w/ 2 levels "no","yes": 2 1 2 1 1 1 1 2 1 1 ...
##  $ famsup    : Factor w/ 2 levels "no","yes": 1 2 1 2 2 2 1 2 2 2 ...
##  $ paid      : Factor w/ 2 levels "no","yes": 1 1 2 2 2 2 1 1 2 2 ...
##  $ activities: Factor w/ 2 levels "no","yes": 1 1 1 2 1 2 1 1 1 2 ...
##  $ higher    : Factor w/ 2 levels "no","yes": 2 2 2 2 2 2 2 2 2 2 ...
##  $ romantic  : Factor w/ 2 levels "no","yes": 1 1 1 2 1 1 1 1 1 1 ...
##  $ famrel    : int  4 5 4 3 4 5 4 4 4 5 ...
##  $ freetime  : int  3 3 3 2 3 4 4 1 2 5 ...
##  $ goout     : int  4 3 2 2 2 2 4 4 2 1 ...
##  $ Dalc      : int  1 1 2 1 1 1 1 1 1 1 ...
##  $ Walc      : int  1 1 3 1 2 2 1 1 1 1 ...
##  $ health    : int  3 3 3 5 5 5 3 1 1 5 ...
##  $ absences  : int  5 3 8 1 2 8 0 4 0 0 ...
##  $ G1        : int  2 7 10 14 8 14 12 8 16 13 ...
##  $ G2        : int  8 8 10 14 12 14 12 9 17 14 ...
##  $ G3        : int  8 8 11 14 12 14 12 10 18 14 ...
##  $ alc_use   : num  1 1 2.5 1 1.5 1.5 1 1 1 1 ...
##  $ high_use  : logi  FALSE FALSE TRUE FALSE FALSE FALSE ...
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

The student alcohol consumption data is based on the UCI-data of student achievement in seondary education in two Portugese schools. Variables consist of the students sex, grades, absences and health information, for example. The data is combined from two datasets: the dataset that describes students performance in Mathematics and the dataset that describes students performance in the Portugese language. The combined dataset also shows the combined grades of the students in all three years of secondary education. The alcohon consumption is measured with the variable “alc_use” and high alcohol consumption with the variable “high_use”. If the alcohol consumption has a value of over 2, is the student considered to use a lot of alcohol and therefore the “high_use” - variable receives a value of “true”.

3) Chosen variables for exploration in regards to their relationship with alcohol consumption

For my analysis i have taken variables sex, Mothers education, failures, and absences as my indicators of alcohol consumption. I believe that the sex of a person, failures in recent tests and absences from classes offer great explanatory power over a persons consumption of alcohol. I believe that the drinking habits differ among boys and girls. I also believe that misfortune and failures drive people into drinking. Therefore failures in courses might also lead to absences. Together however and separately I still belive they lead into higher alcohol consumption. As my fourth variable I chose the level of the child’s Mother’s education. This is due to the belief that educated parents warn their children more about alchol and also don’t provide a model of heavy drinking that often. Therefore higher education would mean lower levels of alcohol consumption.

str(alc$Medu)
##  int [1:382] 4 1 1 4 3 4 2 4 3 3 ...
summary(alc$Medu)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.000   2.000   3.000   2.806   4.000   4.000

4) Graphical exploration of the chosen variables

alc$sex
##   [1] F F F F F M M F M M F F M M M F F F M M M M M M F F M M M M M M M M M
##  [36] F M M F F F M M M F F F M M F F F M F F F F M M F F F F F F F M F F F
##  [71] M M F M F M M F M M F M M F M F F F F M M F F F F M F M F F F M M M F
## [106] M F F M M M F M F F M M M M M M M M F M F M F M F F M F F F F M F M F
## [141] M F M M F F M M F F F M M M M M F M F M M F M M M M M F F F M M M F F
## [176] M F M M M M M M F F F M M M F M F F M M M F M M F F F F F F F F F F F
## [211] F F M F M F F F M F F F F F M F F F M M F F M M M M M M F F M M M M M
## [246] M F M M M M M M M M M M M F M M F F M M F F M M F M F F F F M F F F M
## [281] M F M M M F F F F M F F M M M F F M M F F F M F M F F M M F F F F F F
## [316] F F F M M M F F F M F F F F F F F F M M M F F F M M F M M M M M M F F
## [351] F M F F F F M M F F F M F F F F F F F M M M M M F F F F F F M M
## Levels: F M
alc$failures
##   [1] 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0
##  [36] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
##  [71] 0 0 1 0 0 0 0 0 3 2 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [106] 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 1
## [141] 2 0 0 1 0 0 3 2 0 2 0 0 0 2 2 1 1 2 0 0 2 3 0 1 2 2 0 0 0 0 2 0 0 2 0
## [176] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 2 0 0
## [211] 0 0 0 0 0 0 0 1 0 0 0 0 0 0 2 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [246] 0 0 0 0 3 0 0 0 0 2 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## [281] 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0
## [316] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 0 1 0 0 0
## [351] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0 0 1 0 2 0 0 0
alc$absences
##   [1]  5  3  8  1  2  8  0  4  0  0  1  2  1  1  0  5  8  3  9  5  0  0  1
##  [24]  1  2 10  5  2  3 10  0  1  0  0  2  2  1  6  2  8 20  8  1  0 14  6
##  [47]  9  3  3  2  1  1  5  0  3  5  0  6  1  2  3  3  2  1  0  2  2  2  1
##  [70]  9  1  0  2  1 29  3  4  0  1 12 13  1  3  7  3  2  5  5  4  9 12  1
##  [93]  5  2  1  4  3  4  1  5  1 13  0  3 21  0 10  6  3  1  7  3  5  2  9
## [116] 10  6  4  9  3  3 17  4  1  6  2 11  0  1  0  1  8  0  3  7 14  1  2
## [139]  1  2  0  3  3  6  2  3  0 11  0  1  4  3  8  0  0  6  5  3  0  1 11
## [162]  6  3  2  4  0  2  0  0  0  2  1  0  0  2  2  1  4  9  4  7  3  1  0
## [185] 44 11  9  1  0  8  5  8  0 14  4  0  4  4 12 27  0  2  5  2 20  6 21
## [208]  4  7  4  3  7 14  0 12  9  2 19 12  4  2  1  6  1  4  2 10  6  1  8
## [231]  5  7  2  7  1 10  5  2 12  2  8 11  1  4  0  2  5  6  3 14  8  0  4
## [254]  5  4  1  0  1  4  8  5  1 12  2  2  3  0 10  1  8  7  6 16  7  2  2
## [277]  4  6 45 14 11  8  4 26 18  2  2  2  8  1  2  3  3  4  6  0  3  4  2
## [300]  1  5  0  2  7  0  0  0  0  0  1  6  1  1 18  8  2  0  0  2  2  5  4
## [323]  4  4  1  4  4  0  3 12  0  6  2  3  0  8  9  2  6  6  0  0  9  8  3
## [346]  4  4  4  0  3  1  3  0  3  2  0  2  0  0  0  5  3  8  9  0  3  2  0
## [369] 14  4  2  3  0  7  1  6  2  2  2  3  4  2
alc$high_use
##   [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
##  [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE
##  [23] FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE
##  [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
##  [45] FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE  TRUE
##  [56] FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE FALSE
##  [67]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE
##  [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
##  [89] FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [100] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [111]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [122]  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE
## [133] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
## [144] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE
## [155]  TRUE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE  TRUE FALSE
## [166]  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE
## [177]  TRUE  TRUE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE
## [188] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE
## [199]  TRUE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [210]  TRUE FALSE FALSE  TRUE FALSE  TRUE FALSE FALSE  TRUE  TRUE  TRUE
## [221] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE
## [232] FALSE FALSE FALSE  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE
## [243]  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE
## [254] FALSE  TRUE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [265] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
## [276] FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE  TRUE  TRUE FALSE
## [287] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE FALSE
## [298] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [309] FALSE FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE
## [320]  TRUE  TRUE  TRUE FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE
## [331]  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE  TRUE FALSE  TRUE
## [342] FALSE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE
## [353] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE
## [364] FALSE FALSE FALSE  TRUE FALSE  TRUE  TRUE FALSE FALSE FALSE  TRUE
## [375] FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
g1 <- ggplot(data = alc, aes(x = high_use, fill = sex))
g1 + geom_bar()

g2 <- ggplot(data = alc, aes(x=failures, fill = high_use)) 
g2 + geom_bar()

g3 <- ggplot(data = alc, aes(x = absences, fill = high_use))
g3 + geom_bar()

g4 <- ggplot(data = alc, aes(x=Medu, fill = high_use)) 
g4 + geom_bar()

From the plots that i have drawn out of the relationships between high use of alcohol and sex, failures, absences and Mothers eduaction. The Medu variable is split into 5 levels from 0 to 4, 0 being no education at all to 4 being higher education. In the education level- plot we can’t see any strong correlation between high alcohol consumption and the student’s mother’s education level.

In the sex plot we can see that men tend to drink more than women. In the dataset there seems to be more women than men, but the men are more heavily represented in the category of more heavy drinkers.

In the remaining absences and failures plots the trend tends to be that the more absences or failures a person has, the more he tends to drink. In the absence-variable this can be seen more clearly, but in the failures-variable it is hard to say anything too conclusive, due to the small sample size of the categories. The students tend not to have many failing grades.

5) Logistic regression model of designed variables

m <- glm(high_use ~ failures + absences + sex + Medu, data = alc, family = "binomial")
summary(m)
## 
## Call:
## glm(formula = high_use ~ failures + absences + sex + Medu, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.1839  -0.8371  -0.6004   1.1004   2.0266  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.916369   0.384473  -4.984 6.22e-07 ***
## failures     0.453431   0.199383   2.274  0.02296 *  
## absences     0.093094   0.023125   4.026 5.68e-05 ***
## sexM         0.939703   0.244341   3.846  0.00012 ***
## Medu         0.005038   0.116739   0.043  0.96558    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 424.40  on 377  degrees of freedom
## AIC: 434.4
## 
## Number of Fisher Scoring iterations: 4
# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                    OR      2.5 %    97.5 %
## (Intercept) 0.1471403 0.06747099 0.3057165
## failures    1.5737029 1.06754868 2.3457065
## absences    1.0975651 1.05122651 1.1511339
## sexM        2.5592205 1.59425355 4.1629212
## Medu        1.0050508 0.80004712 1.2657830

Above is the results of the logistic regression model from four variables (sex, failures, absences and Mothers education level) and their explanatory power to high alcohol usage. According to the summary of the linear regression model, absences and sex seem to be good explanatory power. Every increase of a unit in failure increases the chances of becoming a high user of alcohol by about 0.45 units. In sex it is clear that we can only have men and women as categories, therefore making it very much more likely for the student to become a high user of alcohol if the subjects are men instead of women.

In the 95 % confidence interval levels we can see that the odds ratio to failures are about 1.07 in the lower tail and about 2.35 on the upper tail of the distribution. For absence the digits are 1.05 and 1.15, for sex (Male) it is 1.59 and 4.16. This means that the odds of success for failure (the students have failures and consume high ammount of alcohol) ranges from about 1.06 to about 2.34 on 95 % confidence interval.

6) Predictive power of the model

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.67801047 0.02356021 0.70157068
##    TRUE  0.21989529 0.07853403 0.29842932
##    Sum   0.89790576 0.10209424 1.00000000
g <- ggplot(alc, aes(x = probability, y = high_use, col = prediction))
g + geom_point()

# define a loss function (average prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# compute the average number of wrong predictions in the (training) data
loss_func(alc$high_use, alc$probability)
## [1] 0.2434555
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2617801

In the predictions above we can conclude that our model has a 24 % probability of being incorrect. Therefore it can correctly predict the heavy consumption of alcohol in about 76% of the cases. The model predicts too many students not to be heavy drinkers, seeing as it thinks almost 90% of the cases are not heavy consumptors of alcohol, when in fact only 67% are. The model therefore gives too optimistic image of alcohol consumption among students.


Chapter 4 exercises - classification and clustering

1) Loading the data into the rmd-file

library(tidyverse)
## ── Attaching packages ───────────────────────────────────────────────────────────────────────────── tidyverse 1.3.0 ──
## ✔ tibble  2.1.3     ✔ purrr   0.3.3
## ✔ tidyr   1.0.0     ✔ stringr 1.4.0
## ✔ readr   1.3.1     ✔ forcats 0.4.0
## ── Conflicts ──────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
library(corrplot)
## corrplot 0.84 loaded
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(dplyr)
library(ggplot2)
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
?Boston

2) Structure and dimensions of the dataset

In this exercise we will be using the Housing balues in the suburbs of Boston - dataset. The dataset has 506 observations and 14 variables. Variables are:

crim = crimerate per capita, zn = proportion of residental land, indus = proportion of non-retail business (industry) acres per town, black = proportions of blacks by town. and other population and housing situation describing variables in the Boston suburbs. Here we will be focusing mostly on the crimerates.

3) Summaries of the variables of the data

I will now use pairs and the correlation plot to describe the relationships between variables.

pairs(Boston)

# calculate the correlation matrix and round it
cor_matrix<-cor(Boston) 

# print the correlation matrix
cor_matrix
##                crim          zn       indus         chas         nox
## crim     1.00000000 -0.20046922  0.40658341 -0.055891582  0.42097171
## zn      -0.20046922  1.00000000 -0.53382819 -0.042696719 -0.51660371
## indus    0.40658341 -0.53382819  1.00000000  0.062938027  0.76365145
## chas    -0.05589158 -0.04269672  0.06293803  1.000000000  0.09120281
## nox      0.42097171 -0.51660371  0.76365145  0.091202807  1.00000000
## rm      -0.21924670  0.31199059 -0.39167585  0.091251225 -0.30218819
## age      0.35273425 -0.56953734  0.64477851  0.086517774  0.73147010
## dis     -0.37967009  0.66440822 -0.70802699 -0.099175780 -0.76923011
## rad      0.62550515 -0.31194783  0.59512927 -0.007368241  0.61144056
## tax      0.58276431 -0.31456332  0.72076018 -0.035586518  0.66802320
## ptratio  0.28994558 -0.39167855  0.38324756 -0.121515174  0.18893268
## black   -0.38506394  0.17552032 -0.35697654  0.048788485 -0.38005064
## lstat    0.45562148 -0.41299457  0.60379972 -0.053929298  0.59087892
## medv    -0.38830461  0.36044534 -0.48372516  0.175260177 -0.42732077
##                  rm         age         dis          rad         tax
## crim    -0.21924670  0.35273425 -0.37967009  0.625505145  0.58276431
## zn       0.31199059 -0.56953734  0.66440822 -0.311947826 -0.31456332
## indus   -0.39167585  0.64477851 -0.70802699  0.595129275  0.72076018
## chas     0.09125123  0.08651777 -0.09917578 -0.007368241 -0.03558652
## nox     -0.30218819  0.73147010 -0.76923011  0.611440563  0.66802320
## rm       1.00000000 -0.24026493  0.20524621 -0.209846668 -0.29204783
## age     -0.24026493  1.00000000 -0.74788054  0.456022452  0.50645559
## dis      0.20524621 -0.74788054  1.00000000 -0.494587930 -0.53443158
## rad     -0.20984667  0.45602245 -0.49458793  1.000000000  0.91022819
## tax     -0.29204783  0.50645559 -0.53443158  0.910228189  1.00000000
## ptratio -0.35550149  0.26151501 -0.23247054  0.464741179  0.46085304
## black    0.12806864 -0.27353398  0.29151167 -0.444412816 -0.44180801
## lstat   -0.61380827  0.60233853 -0.49699583  0.488676335  0.54399341
## medv     0.69535995 -0.37695457  0.24992873 -0.381626231 -0.46853593
##            ptratio       black      lstat       medv
## crim     0.2899456 -0.38506394  0.4556215 -0.3883046
## zn      -0.3916785  0.17552032 -0.4129946  0.3604453
## indus    0.3832476 -0.35697654  0.6037997 -0.4837252
## chas    -0.1215152  0.04878848 -0.0539293  0.1752602
## nox      0.1889327 -0.38005064  0.5908789 -0.4273208
## rm      -0.3555015  0.12806864 -0.6138083  0.6953599
## age      0.2615150 -0.27353398  0.6023385 -0.3769546
## dis     -0.2324705  0.29151167 -0.4969958  0.2499287
## rad      0.4647412 -0.44441282  0.4886763 -0.3816262
## tax      0.4608530 -0.44180801  0.5439934 -0.4685359
## ptratio  1.0000000 -0.17738330  0.3740443 -0.5077867
## black   -0.1773833  1.00000000 -0.3660869  0.3334608
## lstat    0.3740443 -0.36608690  1.0000000 -0.7376627
## medv    -0.5077867  0.33346082 -0.7376627  1.0000000
cor_matrix %>% round(digits = 2)
##          crim    zn indus  chas   nox    rm   age   dis   rad   tax
## crim     1.00 -0.20  0.41 -0.06  0.42 -0.22  0.35 -0.38  0.63  0.58
## zn      -0.20  1.00 -0.53 -0.04 -0.52  0.31 -0.57  0.66 -0.31 -0.31
## indus    0.41 -0.53  1.00  0.06  0.76 -0.39  0.64 -0.71  0.60  0.72
## chas    -0.06 -0.04  0.06  1.00  0.09  0.09  0.09 -0.10 -0.01 -0.04
## nox      0.42 -0.52  0.76  0.09  1.00 -0.30  0.73 -0.77  0.61  0.67
## rm      -0.22  0.31 -0.39  0.09 -0.30  1.00 -0.24  0.21 -0.21 -0.29
## age      0.35 -0.57  0.64  0.09  0.73 -0.24  1.00 -0.75  0.46  0.51
## dis     -0.38  0.66 -0.71 -0.10 -0.77  0.21 -0.75  1.00 -0.49 -0.53
## rad      0.63 -0.31  0.60 -0.01  0.61 -0.21  0.46 -0.49  1.00  0.91
## tax      0.58 -0.31  0.72 -0.04  0.67 -0.29  0.51 -0.53  0.91  1.00
## ptratio  0.29 -0.39  0.38 -0.12  0.19 -0.36  0.26 -0.23  0.46  0.46
## black   -0.39  0.18 -0.36  0.05 -0.38  0.13 -0.27  0.29 -0.44 -0.44
## lstat    0.46 -0.41  0.60 -0.05  0.59 -0.61  0.60 -0.50  0.49  0.54
## medv    -0.39  0.36 -0.48  0.18 -0.43  0.70 -0.38  0.25 -0.38 -0.47
##         ptratio black lstat  medv
## crim       0.29 -0.39  0.46 -0.39
## zn        -0.39  0.18 -0.41  0.36
## indus      0.38 -0.36  0.60 -0.48
## chas      -0.12  0.05 -0.05  0.18
## nox        0.19 -0.38  0.59 -0.43
## rm        -0.36  0.13 -0.61  0.70
## age        0.26 -0.27  0.60 -0.38
## dis       -0.23  0.29 -0.50  0.25
## rad        0.46 -0.44  0.49 -0.38
## tax        0.46 -0.44  0.54 -0.47
## ptratio    1.00 -0.18  0.37 -0.51
## black     -0.18  1.00 -0.37  0.33
## lstat      0.37 -0.37  1.00 -0.74
## medv      -0.51  0.33 -0.74  1.00
# visualize the correlation matrix
corrplot(cor_matrix, method="circle", type = "upper", cl.pos = "b", tl.pos = "d", tl.cex = 0.6)

There are many variables that have a positive and negative correlation together. As we can see from the correlation plot, the strngest correlation between variables seems to be with the index of accessibility to radial highways and full property-tax rate per ten thousand dollars. Strong negative correlation can be found between variables such as the proportion of owner-occupied units prior to 1940 (age-variable) and the weighted mean of distances to the Boston employment centres (dis). The red dots convey a more negative correlation and blue dots a positive one. The bigger the circle seems to be, the more darker the color and the stronger the correlation.

4) Standardizing the data and summary of scaled data

# center and standardize variables
boston_scaled <- scale(Boston)

# summaries of the scaled variables
summary(boston_scaled)
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
# class of the boston_scaled object
class(boston_scaled)
## [1] "matrix"
# change the object to data frame
boston_scaled <- as.data.frame(boston_scaled)

# create a quantile vector of crim and print it
bins <- quantile(boston_scaled$crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
# create a categorical variable 'crime'
crime <- cut(boston_scaled$crim, labels = c("low", "med_low", "med_high", "high"), breaks = c(bins), include.lowest = TRUE)

# look at the table of the new factor crime
table(crime)
## crime
##      low  med_low med_high     high 
##      127      126      126      127
# remove original crim from the dataset
boston_scaled <- dplyr::select(boston_scaled, -crim)

# add the new categorical value to scaled data
boston_scaled <- data.frame(boston_scaled, crime)

# number of rows in the Boston dataset 
n <- nrow(boston_scaled)

# choose randomly 80% of the rows
ind <- sample(n,  size = n * 0.8)

# create train set
train <- boston_scaled[ind,]

# create test set 
test <- boston_scaled[-ind,]

Here our goal is to standardize the data for further analysis, so we scale it (scaling reduces the mean from each column value and divides it by standard deviation). Furthermore we divide the crime-variable into quantiles and give the quantiles labels. Now the crimerates of each area of the suburbs of Boston become a categorical variable. Lastly we drop the old crimrate variable from the dataframe and add the new one into it’s place. Then we divide the data into 5 categories which 4 are meant for testing and 1 is meant for training our clustering and classification models.

5) Fitting the linear discriminant analysis

# linear discriminant analysis
lda.fit <- lda(crime ~ ., data = train)

# print the lda.fit object
lda.fit
## Call:
## lda(crime ~ ., data = train)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2747525 0.2277228 0.2425743 0.2549505 
## 
## Group means:
##                  zn      indus        chas        nox          rm
## low       1.0039957 -0.9267934 -0.09498221 -0.9092573  0.46606422
## med_low  -0.1234836 -0.2025559  0.02723291 -0.5375839 -0.13775149
## med_high -0.3804849  0.1313309  0.12941585  0.4068228  0.04673242
## high     -0.4872402  1.0170891 -0.08120770  1.0693742 -0.39463194
##                 age        dis        rad        tax     ptratio
## low      -0.9257136  0.9285967 -0.7004450 -0.7526518 -0.44303064
## med_low  -0.2598340  0.3295556 -0.5586861 -0.4608595  0.01351028
## med_high  0.3904025 -0.3562795 -0.4017782 -0.3079465 -0.22596738
## high      0.8088362 -0.8437082  1.6384176  1.5142626  0.78111358
##                black       lstat        medv
## low       0.37345701 -0.76780926  0.56035901
## med_low   0.35797874 -0.08499805 -0.03512233
## med_high  0.07083964  0.06008625  0.10072490
## high     -0.80242024  0.87625939 -0.70376633
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.07734481  0.68749803 -0.70745863
## indus   -0.02930265 -0.16297561  0.62118690
## chas    -0.09549043 -0.05862914  0.15399036
## nox      0.36829633 -0.69151781 -1.55006461
## rm      -0.13258567 -0.06090554 -0.12620716
## age      0.25392857 -0.35342625  0.02293232
## dis     -0.07962346 -0.22509412  0.19045764
## rad      3.11361209  0.92763814  0.08822090
## tax      0.07267982  0.01264594  0.33262317
## ptratio  0.10508628 -0.01753996 -0.24370129
## black   -0.13147599 -0.02171838  0.15605126
## lstat    0.18549279 -0.23905066  0.29828563
## medv     0.18393874 -0.32436819 -0.25703563
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9497 0.0373 0.0130
# target classes as numeric
classes <- as.numeric(train$crime)

# plot the lda results
plot(lda.fit, dimen = 2, col = classes, pch = classes)

Here we create a linear discriminate analysis model with crimerates as the target variable. Our explanatory variable is all of the other variables.

6) Predicting classes with the linear discriminant analysis - model

# save the correct classes from test data
correct_classes <- test$crime

# remove the crime variable from test data
test <- dplyr::select(test, -crime)

# predict classes with test data
lda.pred <- predict(lda.fit, newdata = test)

# cross tabulate the results
table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       10       5        1    0
##   med_low   10      17        7    0
##   med_high   0       8       19    1
##   high       0       0        0   24

Here we cross-tabled our predictive model and it’s results with the correct results of the dataset. Our model was quite good at predicting high - crimerate areas. However the model is prone to uncorrectly predict medium high crimerate areas to medium lows, medium low areas to lows and low crimerate areas to medium low. The higher the crimerate are turly was, the better the power of the models prediction.

7) Using K-means to calculate distances between observations

library(MASS)

# Scale the variables to get comparable distances
boston_scaled2 <- scale(Boston)

# k-means clustering
km <-kmeans(Boston, centers = 3)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

set.seed(123)

# determine the number of clusters
k_max <- 10

# calculate the total within sum of squares
twcss <- sapply(1:k_max, function(k){kmeans(Boston, k)$tot.withinss})

# visualize the results
qplot(x = 1:k_max, y = twcss, geom = 'line')

# k-means clustering
km <-kmeans(Boston, centers = 2)

# plot the Boston dataset with clusters
pairs(Boston, col = km$cluster)

# plot the Boston dataset clusters in sets of 5
pairs(Boston[1:5], col = km$cluster)

pairs(Boston[6:10], col = km$cluster)

pairs(Boston[11:14], col = km$cluster)

As we can see from our plot calculating the wcss of our klustering, the value of wcss dramatically drops when we have two clusters instead of one. When we add additional clusters to it, the value does not change all that much. Therefore it seems most reasonable to continue to kluster the variables with k-means using two clusters.

In the above example I plotted the variables with each other using the pairs- function. The pairs have been illustrated in sets of 5, from the first variable to the fifth, sixth to tenth and eleventh to fourteenth (only 14 variables in the data), to help with the visualization. In many of the cases the clusters are quite unclear and the clusters usually are made into “low” and “high” - categories.


Exercise 5

Loading packages and the human-data into the file

install.packages(“dplyr”, repos = “http://cran.us.r-project.org”) install.packages(“tidyr”, repos = “http://cran.us.r-project.org”) install.packages(“stringr”, repos = “http://cran.us.r-project.org”) install.packages(“ggplot2”, repos = “http://cran.us.r-project.org”) install.packages(“GGally”, repos = “http://cran.us.r-project.org”) install.packages(“corrplot”, repos = “http://cran.us.r-project.org”) install.packages(“FactoMineR”, repos = “http://cran.us.r-project.org”)

library(dplyr) library(tidyr) library(stringr) library(ggplot2) library(GGally) library(corrplot)

human <- read.table("/Users/Lotta/IODS-project/data/human", header = TRUE)

1) VIsualizing the human - variables and explaning the structure

# visualize the 'human' variables
ggpairs(human)

str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ eduDiff        : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ workDiff       : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ lifeExpectancy : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ eduExpectancy  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNICapita      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ motherMortality: int  4 6 6 5 6 7 9 28 11 8 ...
##  $ teenMoms       : num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ parliamentRep  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
summary(human)
##     eduDiff          workDiff      lifeExpectancy  eduExpectancy  
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##    GNICapita      motherMortality     teenMoms      parliamentRep  
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
# compute the correlation matrix and visualize it with corrplot
cor(human) %>% corrplot

We have a dataset consisting of 155 observations and 8 variables. In the plots and especially correlation plot we can see that there is a string positive correlation between life expectancy and education expectancy. Also other notable correlations are:

Mother mortalityrate and life expectancy (negtive correlation). GNI and life expectancy and educational expectancy GNI and mothers mortality and adolecent birthrate (small negative correlation)

2) Principal component analysis (PCA)

# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human)

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

3) Standardize the dataset and perform PCA again

# standardize the variables
human_std <- scale(human)

# print out summaries of the standardized variables
summary(human_std)
##     eduDiff           workDiff       lifeExpectancy    eduExpectancy    
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7188   Min.   :-2.7378  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6425   1st Qu.:-0.6782  
##  Median : 0.3503   Median : 0.2316   Median : 0.3056   Median : 0.1140  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.6717   3rd Qu.: 0.7126  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 1.4218   Max.   : 2.4730  
##    GNICapita       motherMortality      teenMoms       parliamentRep    
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)
pca_human
## Standard deviations (1, .., p=8):
## [1] 2.0708380 1.1397204 0.8750485 0.7788630 0.6619563 0.5363061 0.4589994
## [8] 0.3222406
## 
## Rotation (n x k) = (8 x 8):
##                         PC1         PC2         PC3         PC4        PC5
## eduDiff         -0.35664370  0.03796058 -0.24223089  0.62678110 -0.5983585
## workDiff         0.05457785  0.72432726 -0.58428770  0.06199424  0.2625067
## lifeExpectancy  -0.44372240 -0.02530473  0.10991305 -0.05834819  0.1628935
## eduExpectancy   -0.42766720  0.13940571 -0.07340270 -0.07020294  0.1659678
## GNICapita       -0.35048295  0.05060876 -0.20168779 -0.72727675 -0.4950306
## motherMortality  0.43697098  0.14508727 -0.12522539 -0.25170614 -0.1800657
## teenMoms         0.41126010  0.07708468  0.01968243  0.04986763 -0.4672068
## parliamentRep   -0.08438558  0.65136866  0.72506309  0.01396293 -0.1523699
##                         PC6         PC7         PC8
## eduDiff          0.17713316  0.05773644  0.16459453
## workDiff        -0.03500707 -0.22729927 -0.07304568
## lifeExpectancy  -0.42242796 -0.43406432  0.62737008
## eduExpectancy   -0.38606919  0.77962966 -0.05415984
## GNICapita        0.11120305 -0.13711838 -0.16961173
## motherMortality  0.17370039  0.35380306  0.72193946
## teenMoms        -0.76056557 -0.06897064 -0.14335186
## parliamentRep    0.13749772  0.00568387 -0.02306476
# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.8, 1), col = c("grey40", "deeppink2"))

# create and print out a summary of pca_human
s <- summary(pca_human)

# rounded percentages of variance captured by each PC
pca_pr <- round(100*s$importance[2,], digits = 1) 

# print out the percentages of variance
pca_pr
##  PC1  PC2  PC3  PC4  PC5  PC6  PC7  PC8 
## 53.6 16.2  9.6  7.6  5.5  3.6  2.6  1.3
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot
biplot(pca_human, cex = c(0.8, 1), col = c("grey40", "deeppink2"), xlab = pc_lab[1], ylab = pc_lab[2])

When we did not scale our variables our principal component analysis didnt know how to deal with the variance of each of our variable. Therefore the result was that almost all of our variables had variance that the analysis could not handle. Stronger variance results in more important variables, but because the variables were not scaled certain variables might have had a falty variance. After the scaling this was not the case.

4)

We can see that the first principal component (dimension) resulted of 53.6 % of variance of our variables. The second principal component resulted in 16.2 % of the variables variance. Our analysis was able to therefore separate two strong principal components that are not correlated with each other.

5) Visualize tha variables in barplots

library(FactoMineR)
data(tea)


str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
# column names to keep in the dataset
keep_columns <- c("Tea", "How", "how", "sugar", "where", "lunch")

# select the 'keep_columns' to create a new dataset
tea_time <- dplyr::select(tea, one_of(keep_columns))

# look at the summaries and structure of the data
summary(tea_time)
##         Tea         How                      how           sugar    
##  black    : 74   alone:195   tea bag           :170   No.sugar:155  
##  Earl Grey:193   lemon: 33   tea bag+unpackaged: 94   sugar   :145  
##  green    : 33   milk : 63   unpackaged        : 36                 
##                  other:  9                                          
##                   where           lunch    
##  chain store         :192   lunch    : 44  
##  chain store+tea shop: 78   Not.lunch:256  
##  tea shop            : 30                  
## 
str(tea_time)
## 'data.frame':    300 obs. of  6 variables:
##  $ Tea  : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How  : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how  : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ sugar: Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where: Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ lunch: Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 45, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables;
## they will be dropped

Here in the first plot we can see that many of the people that drink tea most likely like to drink it without sugar, earl gray and at a chain store. The tea should be drunk at a time that is not lunch and is from a tea bag.

Do the multivariate component analysis

# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.279   0.261   0.219   0.189   0.177   0.156
## % of var.             15.238  14.232  11.964  10.333   9.667   8.519
## Cumulative % of var.  15.238  29.471  41.435  51.768  61.434  69.953
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11
## Variance               0.144   0.141   0.117   0.087   0.062
## % of var.              7.841   7.705   6.392   4.724   3.385
## Cumulative % of var.  77.794  85.500  91.891  96.615 100.000
## 
## Individuals (the 10 first)
##                       Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3
## 1                  | -0.298  0.106  0.086 | -0.328  0.137  0.105 | -0.327
## 2                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 3                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 4                  | -0.530  0.335  0.460 | -0.318  0.129  0.166 |  0.211
## 5                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 6                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 7                  | -0.369  0.162  0.231 | -0.300  0.115  0.153 | -0.202
## 8                  | -0.237  0.067  0.036 | -0.136  0.024  0.012 | -0.695
## 9                  |  0.143  0.024  0.012 |  0.871  0.969  0.435 | -0.067
## 10                 |  0.476  0.271  0.140 |  0.687  0.604  0.291 | -0.650
##                       ctr   cos2  
## 1                   0.163  0.104 |
## 2                   0.735  0.314 |
## 3                   0.062  0.069 |
## 4                   0.068  0.073 |
## 5                   0.062  0.069 |
## 6                   0.062  0.069 |
## 7                   0.062  0.069 |
## 8                   0.735  0.314 |
## 9                   0.007  0.003 |
## 10                  0.643  0.261 |
## 
## Categories (the 10 first)
##                        Dim.1     ctr    cos2  v.test     Dim.2     ctr
## black              |   0.473   3.288   0.073   4.677 |   0.094   0.139
## Earl Grey          |  -0.264   2.680   0.126  -6.137 |   0.123   0.626
## green              |   0.486   1.547   0.029   2.952 |  -0.933   6.111
## alone              |  -0.018   0.012   0.001  -0.418 |  -0.262   2.841
## lemon              |   0.669   2.938   0.055   4.068 |   0.531   1.979
## milk               |  -0.337   1.420   0.030  -3.002 |   0.272   0.990
## other              |   0.288   0.148   0.003   0.876 |   1.820   6.347
## tea bag            |  -0.608  12.499   0.483 -12.023 |  -0.351   4.459
## tea bag+unpackaged |   0.350   2.289   0.056   4.088 |   1.024  20.968
## unpackaged         |   1.958  27.432   0.523  12.499 |  -1.015   7.898
##                       cos2  v.test     Dim.3     ctr    cos2  v.test  
## black                0.003   0.929 |  -1.081  21.888   0.382 -10.692 |
## Earl Grey            0.027   2.867 |   0.433   9.160   0.338  10.053 |
## green                0.107  -5.669 |  -0.108   0.098   0.001  -0.659 |
## alone                0.127  -6.164 |  -0.113   0.627   0.024  -2.655 |
## lemon                0.035   3.226 |   1.329  14.771   0.218   8.081 |
## milk                 0.020   2.422 |   0.013   0.003   0.000   0.116 |
## other                0.102   5.534 |  -2.524  14.526   0.197  -7.676 |
## tea bag              0.161  -6.941 |  -0.065   0.183   0.006  -1.287 |
## tea bag+unpackaged   0.478  11.956 |   0.019   0.009   0.000   0.226 |
## unpackaged           0.141  -6.482 |   0.257   0.602   0.009   1.640 |
## 
## Categorical variables (eta2)
##                      Dim.1 Dim.2 Dim.3  
## Tea                | 0.126 0.108 0.410 |
## How                | 0.076 0.190 0.394 |
## how                | 0.708 0.522 0.010 |
## sugar              | 0.065 0.001 0.336 |
## where              | 0.702 0.681 0.055 |
## lunch              | 0.000 0.064 0.111 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")

Dimension one accounts for 15.24% percentage of the variance of the variabes assigned to it. The second dimension is not the significantly less important as it covers 14.23% of the variance. People that want to drink their tea from a tea bag most likely want to drink it at a chain store and with sugar. They also tend to want to drink Earl Gray. Participants that want unpackaged tea usually go for more green tea in a tea shop.


install.packages(“dplyr”, repos = “http://cran.us.r-project.org”) install.packages(“ggplo2”, repos = “http://cran.us.r-project.org” ) install.packages(“GGally”, repos = “http://cran.us.r-project.org”)

library(dplyr) library(ggplot2) library(GGally)

Exercise 6

Reading tables into file and factor the categorical variables again

BPRS <- read.table("data/BPRS")
RATS <- read.table("data/RATS")
str(BPRS)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: int  1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...
str(RATS)
## 'data.frame':    176 obs. of  5 variables:
##  $ ID    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Group : int  1 1 1 1 1 1 1 1 2 2 ...
##  $ WD    : Factor w/ 11 levels "WD1","WD15","WD22",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ Weight: int  240 225 245 260 255 260 275 245 410 405 ...
##  $ Time  : int  1 1 1 1 1 1 1 1 1 1 ...
# Factor treatment & subject
BPRS$treatment <- factor(BPRS$treatment)
BPRS$subject <- factor(BPRS$subject)

# Factor variables ID and Group
RATS$ID <- factor(RATS$ID)
RATS$Group <- factor(RATS$Group)

MABS 8 analysis for the RATS-data

# Draw the plot
ggplot(RATS, aes(x = Time, y = Weight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  theme(legend.position = "none") + 
  scale_y_continuous(limits = c(min(RATS$Weight), max(RATS$Weight)))

### Unstandadized plot of the RATS-plot

asdas

Standadizing the RATS weight variable

RATS <- RATS %>%
  group_by(Time) %>%
  mutate(stdWeight = ((Weight - mean(Weight)) / sd(Weight))) %>%
  ungroup()

# Plot again with the standardised bprs
ggplot(RATS, aes(x = Time, y = stdWeight, linetype = ID)) +
  geom_line() +
  scale_linetype_manual(values = rep(1:10, times=4)) +
  facet_grid(. ~ Group, labeller = label_both) +
  scale_y_continuous(name = "standardized weight")

Creating a summary Graph for RATS (with standadized weight)

# Number of WD, baseline (week 0) included
n <- max(RATS$Time)
n
## [1] 64
# Summary data with mean and standard error of Weight by weekday 
RATSL <- RATS %>%
  group_by(Group, Time) %>%
  summarise( mean = mean(Weight), se = (sd(Weight)/ sqrt(n) )) %>%
  ungroup()

# Glimpse the data
glimpse(RATSL)
## Observations: 33
## Variables: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2…
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, 29…
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 267.…
## $ se    <dbl> 1.902697, 1.636634, 1.434488, 1.700102, 1.382185, 1.472971…
# Plot the mean profiles
ggplot(RATSL, aes(x = Time, y = mean, linetype = Group, shape = Group)) +
  geom_line() +
  scale_linetype_manual(values = c(1,2,3)) +
  geom_point(size=3) +
  scale_shape_manual(values = c(1,2,3)) +
  geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3) +
  theme(legend.position = c(0.8,0.8)) +
  scale_y_continuous(name = "mean(Weight) +/- se(Weight)")

MABS analysis

# Create a summary data by treatment and subject with mean as the summary variable (ignoring baseline time = 1).
RATS8S <- RATS %>%
  filter(Time > 0) %>%
  group_by(Group, ID) %>%
  summarise( mean=mean(Weight) ) %>%
  ungroup()

# Glimpse the data
glimpse(RATS8S)
## Observations: 16
## Variables: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.7273…
RATS$Weight
##   [1] 240 225 245 260 255 260 275 245 410 405 445 555 470 535 520 510 250
##  [18] 230 250 255 260 265 275 255 415 420 445 560 465 525 525 510 255 230
##  [35] 250 255 255 270 260 260 425 430 450 565 475 530 530 520 260 232 255
##  [52] 265 270 275 270 268 428 440 452 580 485 533 540 515 262 240 262 265
##  [69] 270 275 273 270 438 448 455 590 487 535 543 530 258 240 265 268 273
##  [86] 277 274 265 443 460 455 597 493 540 546 538 266 243 267 270 274 278
## [103] 276 265 442 458 451 595 493 525 538 535 266 244 267 272 273 278 271
## [120] 267 446 464 450 595 504 530 544 542 265 238 264 274 276 284 282 273
## [137] 456 475 462 612 507 543 553 550 272 247 268 273 278 279 281 274 468
## [154] 484 466 618 518 544 555 553 278 245 269 275 280 281 284 278 478 496
## [171] 472 628 525 559 548 569
# Draw a boxplot of the mean versus treatment
ggplot(RATS8S, aes(x = Group, y = mean)) +
  geom_boxplot() + ggtitle("Boxplot of weights in groups") +
  stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white") +
  scale_y_continuous(name = "mean(Weight), weekdays 1-64")

Here we can see that all of the groups have an outlier. With group 1 and 3 the outlier is dragging the mean of the groups weights down, while in group 2 it is the opposite.

MABS 9 analysis for the BPRS-data

str(BPRS)
## 'data.frame':    360 obs. of  5 variables:
##  $ treatment: Factor w/ 2 levels "1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ subject  : Factor w/ 20 levels "1","2","3","4",..: 1 2 3 4 5 6 7 8 9 10 ...
##  $ weeks    : Factor w/ 9 levels "week0","week1",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ bprs     : int  42 58 54 55 72 48 71 30 41 57 ...
##  $ week     : int  0 0 0 0 0 0 0 0 0 0 ...
# Plot the BPRS - data
ggplot(BPRS, aes(x = week, y = bprs, group = subject)) +
  geom_line(aes(linetype = subject)) + scale_x_continuous(name = "Week", breaks = seq(0, 8, 1))

# create a regression model BPRS_reg
BPRS_reg <- lm(bprs ~ week + treatment, data=BPRS)
summary(BPRS_reg)
## 
## Call:
## lm(formula = bprs ~ week + treatment, data = BPRS)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

Week seems to be a significant indicator in values of the phychiatric scale. This means that every week in treatments seems to significantly lower the persons mental illness scale.

install.packages("lme4", repos = "http://cran.us.r-project.org")
## Installing package into '/Users/Lotta/Library/R/3.6/library'
## (as 'lib' is unspecified)
## 
## The downloaded binary packages are in
##  /var/folders/1y/kp4b3mz525sg68ltbzkcnf200000gn/T//RtmpVqKiAl/downloaded_packages
library(lme4)
## Loading required package: Matrix
## 
## Attaching package: 'Matrix'
## The following objects are masked from 'package:tidyr':
## 
##     expand, pack, unpack
# Create a random intercept model
BPRS_ref <- lmer(bprs ~ week + treatment + (1 | week), data = BPRS, REML = FALSE)
## boundary (singular) fit: see ?isSingular
# Print the summary of the model
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (1 | week)
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2839.3   2858.8  -1414.7   2829.3      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -1.8235 -0.7281 -0.2596  0.5687  4.0804 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev. 
##  week     (Intercept) 5.648e-16 2.377e-08
##  Residual             1.516e+02 1.231e+01
## Number of obs: 360, groups:  week, 9
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.3613  34.124
## week         -2.2704     0.2513  -9.033
## treatment2    0.5722     1.2980   0.441
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.739       
## treatment2 -0.477  0.000
## convergence code: 0
## boundary (singular) fit: see ?isSingular
# create a random intercept and random slope model
BPRS_ref1 <- lmer(bprs ~ week + treatment + (week | subject), data = BPRS, REML = FALSE)

# print a summary of the model
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject)
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7977 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8222  8.0512        
##           week         0.9609  0.9803   -0.51
##  Residual             97.4304  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) week  
## week       -0.582       
## treatment2 -0.247  0.000
# perform an ANOVA test on the two models
anova(BPRS_ref1, BPRS_ref)
## Data: BPRS
## Models:
## BPRS_ref: bprs ~ week + treatment + (1 | week)
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
##           Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
## BPRS_ref   5 2839.3 2858.8 -1414.7   2829.3                             
## BPRS_ref1  7 2745.4 2772.6 -1365.7   2731.4 97.897      2  < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Creating another random intercept and random slope model and plotting it

# create a random intercept and random slope model with the interaction
BPRS_ref2 <- lmer(bprs ~ week + treatment + (week | subject) + week * subject, data = BPRS, REML = FALSE)
## boundary (singular) fit: see ?isSingular
# print a summary of the model
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ week + treatment + (week | subject) + week * subject
##    Data: BPRS
## 
##      AIC      BIC   logLik deviance df.resid 
##   2717.7   2892.6  -1313.9   2627.7      315 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.9828 -0.6291 -0.0213  0.5850  4.0460 
## 
## Random effects:
##  Groups   Name        Variance  Std.Dev.  Corr 
##  subject  (Intercept) 1.650e-07 0.0004062      
##           week        1.663e-08 0.0001290 -1.00
##  Residual             8.661e+01 9.3061824      
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                Estimate Std. Error t value
## (Intercept)     44.0028     4.0742  10.800
## week             0.2333     0.8495   0.275
## treatment2       0.5722     0.9810   0.583
## subject2         3.2778     5.7199   0.573
## subject3         3.8333     5.7199   0.670
## subject4         4.0778     5.7199   0.713
## subject5        27.5778     5.7199   4.821
## subject6        -3.5778     5.7199  -0.625
## subject7        12.2444     5.7199   2.141
## subject8        -0.5778     5.7199  -0.101
## subject9        -2.9222     5.7199  -0.511
## subject10        8.2333     5.7199   1.439
## subject11       13.7556     5.7199   2.405
## subject12        3.7444     5.7199   0.655
## subject13        6.5556     5.7199   1.146
## subject14       -6.4222     5.7199  -1.123
## subject15       12.5222     5.7199   2.189
## subject16       -0.4444     5.7199  -0.078
## subject17       -4.8000     5.7199  -0.839
## subject18      -12.1444     5.7199  -2.123
## subject19       -8.6889     5.7199  -1.519
## subject20       -7.2222     5.7199  -1.263
## week:subject2   -3.6667     1.2014  -3.052
## week:subject3   -3.5417     1.2014  -2.948
## week:subject4   -2.1583     1.2014  -1.796
## week:subject5   -5.4917     1.2014  -4.571
## week:subject6   -2.3833     1.2014  -1.984
## week:subject7   -4.0333     1.2014  -3.357
## week:subject8   -0.5500     1.2014  -0.458
## week:subject9   -3.1583     1.2014  -2.629
## week:subject10  -2.3500     1.2014  -1.956
## week:subject11  -1.2167     1.2014  -1.013
## week:subject12  -2.9500     1.2014  -2.455
## week:subject13  -3.3750     1.2014  -2.809
## week:subject14  -1.7417     1.2014  -1.450
## week:subject15  -3.8667     1.2014  -3.218
## week:subject16  -3.0833     1.2014  -2.566
## week:subject17  -0.6750     1.2014  -0.562
## week:subject18  -0.9917     1.2014  -0.825
## week:subject19  -2.2583     1.2014  -1.880
## week:subject20  -2.5833     1.2014  -2.150
## 
## Correlation matrix not shown by default, as p = 41 > 12.
## Use print(x, correlation=TRUE)  or
##     vcov(x)        if you need it
## convergence code: 0
## boundary (singular) fit: see ?isSingular
# perform an ANOVA test on the two models
anova(BPRS_ref2, BPRS_ref1)
## Data: BPRS
## Models:
## BPRS_ref1: bprs ~ week + treatment + (week | subject)
## BPRS_ref2: bprs ~ week + treatment + (week | subject) + week * subject
##           Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)    
## BPRS_ref1  7 2745.4 2772.6 -1365.7   2731.4                             
## BPRS_ref2 45 2717.7 2892.6 -1313.9   2627.7 103.72     38   5.22e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# draw the plot of RATSL with the observed Weight values
ggplot(BPRS, aes(x = week, y = bprs, group = subject)) +
  geom_line(aes(linetype = subject)) +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1)) +
  scale_y_continuous(name = "Observed phyciatric scale") +
  theme(legend.position = "top")

# Create a vector of the fitted values
Fitted <- fitted(BPRS_ref2)

# Create a new column fitted to RATSL
BPRS <- BPRS %>% mutate(Fitted)

# draw the plot of RATSL with the Fitted values of weight
ggplot(BPRS, aes(x = week, y = Fitted, group = subject)) +
  geom_line(aes(linetype = subject)) +
  scale_x_continuous(name = "Time (weeks)", breaks = seq(0, 8, 1)) +
  scale_y_continuous(name = "Fitted psychiatric scale values") +
  theme(legend.position = "top")